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Although the leave-subject-out cross-validation (CV) has been 
widely used in practice for tuning parameter selection for various 
0^ , nonparametric and semiparametric models of longitudinal data, its 

theoretical property is unknown and solving the associated optimiza- 
tion problem is computationally expensive, especially when there are 
, multiple tuning parameters. In this paper, by focusing on the penal- 

' ized spline method, we show that the leave-subject-out CV is opti- 

mal in the sense that it is asymptotically equivalent to the empirical 
squared error loss function minimization. An efficient Newton-type 
' algorithm is developed to compute the penalty parameters that op- 

timize the CV criterion. Simulated and real data are used to demon- 
strate the effectiveness of the leave-subject-out CV in selecting both 
the penalty parameters and the working correlation matrix. 



> 

' 1. Introduction. In recent years there has seen a growing interest in 

applying flexible statistical models for analyzing longitudinal data or the 
T^lj- I more general clustered data. Various semiparametric [e.g., Zeger and Dig- 

; gle (1994), Zhang et al. (1998), Lin and Ying (2001), Wang, Carroll and 

. Lin (2005)] and nonparametric [e.g.. Fan and Zhang (2000), Lin and Carroll 

£2 : (2000), Rice and Silverman (1991), Wang (1998, 2003), Welsh, Lin and Car- 

roll (2002), Zhu, Fung and He (2008)] models have been proposed and stud- 
ied in the literature. All of these flexible, semiparametric or nonparametric 
^ ' methods require specification of tuning parameters, such as the bandwidth 
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for the local polynomial kernel methods, the number of knots for regres- 
sion splines and the penalty parameter for penalized splines and smoothing 
splines. 

The "leave-subject-out cross-validation" (LsoCV) or more generally called 
"leave-cluster-out cross-validation," introduced by Rice and Silverman (1991), 
has been widely used as the method for selecting tuning parameters in 
analyzing longitudinal data and clustered data; see, for example. Hoover 
et al. (1998), Huang, Wu and Zhou (2002), Wu and Zhang (2006), Wang, 
Li and Huang (2008). The LsoCV is intuitively appealing since the within- 
subject dependence is preserved by leaving out all observations from the 
same subject together in the cross-validation. In spite of its broad accep- 
tance in practice, the use of LsoCV still lacks a theoretical justification to 
date. Computationally, the existing literature has focused on the grid search 
method for finding the minimizer of the LsoCV criterion (LsoCV score) 
[Chiang, Rice and Wu (2001), Huang, Wu and Zhou (2002), Wang, Li and 
Huang (2008)], which is rather inefficient and even prohibitive when there 
are multiple tuning parameters. The goal of this paper is twofold: First, we 
develop a theoretical justification of the LsoCV by showing that the LsoCV 
criterion is asymptotically equivalent to an appropriately defined loss func- 
tion; second, we develop a computationally efficient algorithm to optimize 
the LsoCV criterion for selecting multiple penalty parameters for penalized 
splines. 

We shall focus our presentation on longitudinal data, but all discussions in 
this paper apply to clustered data analysis. Suppose we have n subjects and 
subject i, i = 1, . . . , n, has observations (yjj, Xjj), j = 1, . . . , with yij being 
the jth response and Xjj being the corresponding vector of covariates. Denote 
Yi = iUii,- ■ ■,yini)'^ and Xj = (xji, . . • ,Xj„J. The marginal non- and semi- 
parametric regression model [Welsh, Lin and Carroll (2002), Zhu, Fung and 
He (2008)] assumes that the mean and covariance matrix of the responses 
are given by 

m 

(1) fijj = E{yij\'Xi) = Xjjo/3o + fkj^ijk), cov(yi|Xj) = Sj, 

k=l 

where (3q is a vector of linear regression coefficients, fk, k = 1, . . . ,m, are 
unknown smooth functions, and Sj's are within-subject covariance matrices. 
Denote = {fin, . . . , fiimY' ■ By using a basis expansion to approximate each 
fk, fJ-i can be approximated by fx^ ^ Xj/3 for some design matrix Xj and 
unknown parameter vector f3, which then can be estimated by minimizing 
the penalized weighted least squares 

n m 

(2) pl(/3) = ^(y, - Xi^)^W,ri(y, - X,/3) + ^ Afe/3^Sfc/3, 

1=1 k=l 
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where Wj's are working correlation matrices that are possibly misspecified, 
Sfc is a semi-positive definite matrix such that serves as a roughness 

penalty for /fc, and A = (Ai, . . . , Am) is a vector of penalty parameters. 

Methods for choosing basis functions, constructing the corresponding de- 
sign matrices Xj's and defining the roughness penalty matrices are well 
established in the statistics literature. For example, B-spline basis and ba- 
sis obtained from reproducing kernel Hilbert spaces are commonly used. 
Roughness penalty matrices can be formed corresponding to the squared 
second-difference penalty, the squared second derivative penalty, the thin- 
plate splines penalty or using directly the reproducing kernels. We refer to 
the books by Green and Silverman (1994), Gu (2002) and Wood (2006) for 
thorough treatments of this subject. 

The idea of using working correlation for longitudinal data can be traced 
back to the generalized estimating equations (GEE) of Liang and Zeger 
(1986), where it is established that the mean function can be consistently 
estimated with the correct inference even when the correlation structure 
is misspecified. Liang and Zeger (1986) further demonstrated that using 
a possibly misspecified working correlation structure W has the potential 
to improve the estimation efficiency over methods that completely ignore 
the within-subject correlation. Similarly, results have been obtained in the 
nonparametric setting in Welsh, Lin and Carroll (2002) and Zhu, Fung and 
He (2008). Commonly used working correlation structures include compound 
symmetry and autoregressive models; see Diggle et al. (2002) for a detailed 
discussion. 

In the case of independent data, Li (1986) established the asymptotic 
optimality of the generalized cross-validation (GCV) [Craven and Wahba 
(1979)] for penalty parameter selection by showing that minimizing the GCV 
criterion is asymptotically equivalent to minimizing a suitably defined loss 
function. To understand the theoretical property of LsoCV, we ask the fol- 
lowing question in this paper: What loss function does the LsoCV mimic 
or estimate and how good is this estimation? We are able to show that the 
unweighted mean squared error is the loss function that LsoCV is target- 
ing. Specifically, we obtain that, up to a quantity that does not depend 
on the penalty parameters, the LsoCV score is asymptotically equivalent 
to the mean squared error loss. Our result provides the needed theoretical 
justification of the wide use of LsoCV in practice. 

In two related papers, Gu and Ma (2005) and Han and Gu (2008) devel- 
oped modifications of the GCV for dependent data under assumptions on the 
correlation structure and established the optimality of the modified GCVs. 
Although their modified GCVs work well when the correlation structure is 
correctly specified up to some unknown parameters, they need not be suit- 
able when there is not enough prior knowledge to make such a specification or 
the within-subject correlation is too complicated to be modeled nicely with 
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a simple structure. The main difference between LsoCV and tfiese modified 
GCVs is that LsoCV utilizes working correlation matrices in the estimating 
equations and allows misspecification of the correlation structure. Moreover, 
since the LsoCV and the asymptotic equivalent squared error loss are not 
attached to any specific correlation structure, LsoCV can be used to select 
not only the penalty parameters but also the correlation structure. 

Another contribution of this paper is the development of a fast algorithm 
for optimizing the LsoCV criterion. To avoid computation of a large number 
of matrix inversions, we first derive an asymptotically equivalent approx- 
imation of the LsoCV criterion and then derive a Newton-Raphson type 
algorithm to optimize this approximated criterion. The algorithm is partic- 
ularly useful when we need to select multiple penalty parameters. 

The rest of the paper is organized as follows. Section 2 presents the main 
theoretical results. Section 3 proposes a computationally efficient algorithm 
for optimizing the LosCV criterion. Results from some simulation studies 
and a real data analysis are given in Sections 4 and 5. All technical proofs 
and computational implementations are collected in the Appendix and in 
the supplementary materials [Xu and Huang (2012)]. 

2. Leave-subject-out cross validation. Let /!{■) denote the estimate of the 
mean function obtained by using basis expansion of unknown functions /^'s 
{k= 1, . . . ,m) and solving the minimization problem (2) for /3. Let 
be the estimate of the mean function /i(-) by the same method but using all 
the data except observations from subject i, 1 <i <n. The LsoCV criterion 
is defined as 

1 " 

(3) LsoCV(V^, A) = - J]{y, -;iH](X,)}^{y, -;iH](X,)}. 

i=l 

By leaving out all observations from the same subject, the within-subject 
correlation is preserved in LsoCV. Before giving the formal justification of 
LsoCV, we review a heuristic justification in Section 2.1. Section 2.2 defines 
the suitable loss function. Section 2.3 lists the regularity conditions and 
Section 2.4 provides an example illustrating how the regularity conditions 
in Section 2.3 can be verified using more primitive conditions. Section 2.5 
presents the main theoretical result about the optimality of LsoCV. 

2.1. Heuristic justification. The initial heuristic justification of LsoCV 
by Rice and Silverman (1991) is that it mimics the mean squared predic- 
tion error (MSPE). Consider some new observations (Xj,y*), taken at the 
same design points as the observed data. For a given estimator of the mean 
function /-f(-), denoted as /i(-), the MSPE is defined as 

n n 

MSPE = -Y,E\\y\ - /i(X,)f = - tr(5]) + - 5]i?||^(X,) - /i(X,)f . 

i=\ i=l 
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Using the independence between fL^~^\-) and yj, we obtain that 



i?{LsoCV(W, A)} = - tr(S) + i V EMX,) - /.HI (x,) 
n n ^-^ 



where S = diagjSi, . . . , S^}. When n is large, should be close to 

/i(-), the estimate that uses observations from all subjects. Thus, we expect 
£;{LsoCV(W, A)} to be close to the MSPE. 

2.2. Loss junction. We shall provide a formal justification of LsoCV by 
showing that the LsoCV is asymptotically equivalent to an appropriately 
defined loss function. Denote Y = (y^, . . . ,y^)"^, X = (X^, . . . ,X^)^, and 
W = diagjWi, . . . , W„}. Then, for a given choice of A and W, the mini- 
mizer of (2) has a closed-form expression 

(4) ^X^W-iX-ff;AfcSfc^ X^W-^Y. 

The fitted mean function evaluated at the design points is given by 

(5) A(X|Y,W,A) =X/3 = A(W,A)Y, 
where A(W,A) is the hat matrix defined as 

(6) A(W,A) = X^X^W-iX + ^AfcSfc^ X^W^^. 

From now on, we shall use A for A(W, A) without causing any confusion. 

For a given estimator /i(-) of /Lt(-), define the mean squared error (MSE) 
loss as the true loss function 

(7) L(A) = - ^{/i(X,) - //(Xi)r{A(X.) - /i(XO}. 

Using (5), we obtain that, for the estimator obtained by minimizing (2), the 
true loss function (7) becomes 

L(W, A) = i(AY - iifiPCi - fx) 

(8) 

= - A)^(I - A)n + -e'^A^Ae - - A^) Ae, 

n n n 

where /i = (/x(Xi)'^, . . . , /x(X„)'^)'^, e = Y — /x. Since i?(e|Xi, . . . , X„) = 
and Var(£|Xi, . . . ,X„) = S, the risk function can be derived as 

(9) RCW, A) = E{L(W, A)} = -fi^Cl - Af(I -A)fi + - trfA^AE). 

n n 
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2.3. Regularity conditions. This section states some regularity condi- 
tions needed for our theoretical results. Noticing that unless W = I, A is 
not symmetric. We define a symmetric version of A as A = W~-'^/^AW-'^/^. 
Let Cii be the diagonal block of A^ corresponding to the ith subject. With 
some abuse of notation (but clear from the context), denote by Amax(') and 
Amin(') tlie largest and the smallest eigenvalues of a matrix. The regularity 
conditions involve the quantity ^(S,W) = Aniax(^W~-'^)Amax(W), which 

— 1/2 

takes the minimal value Amax(^) when W = I or W = Xl. Let ei = Yl- Si 
and Uj be Tij X 1 vectors such that u^'uj = 1, i = 1, . . . ,n. 

Condition 1. For some K > 0, i?{(u?"ej)^} < K, i = 1, . . . ,n. 
Condition 2. (i) maxi<j<„{tr(Ajj)} = 0(tr(A)/n) = o(l); 

(ii) maxi<i<„{tr(Cii)} = o(l). 
Condition 3. ^CE,W)/n = o{R{W , X)). 
Condition 4. ^(5], W){n-i tr(A)}V{n-i tr(A^AS)} = o(l). 
Condition 5. Amax(W)Amax(W-i)0(n-2 tr(A)2) = o(l). 

Condition 1 is a mild moment condition that requires that each component 

—1/2 

of the standardized residual Bi = T,- Si has a uniformly bounded fourth 
moment. In particular, when £j's are from the Gaussian distribution, the 
condition holds with K = 3. 

Condition 2 extends the usual condition on controlling leverage, used 
in theoretical analysis of linear regression models. Note that {tr(Ajj)} can 
be interpreted as the leverage of subject i, measuring the contribution to 
the fit from data of subject i and tr(A)/n is the average of the leverages. 
This condition says that the maximum leverage cannot be arbitrarily larger 
than the average leverage or, in other words, there should not be any dom- 
inant or extremely influential subjects. In the special case that all subjects 
have the same design matrices, the condition automatically satisfies since 
tr(Ajj) = tr(A)/n for all i = 1, . . . ,n. Condition 2 is likely to be violated if 
the nj's are very unbalanced. For example, if 10% of subjects have 20 ob- 
servations and the rest of the subjects only have 2 or 3 observations each, 
then maxi<j<„{tr(Ajj)}/{n~^ tr(A)} can be very large. 

When rij's are bounded, any reasonable choice of W would generally 
yield a bounded value of the quantity .^(S,W), and condition 3 reduces 
to ni?(W, A) —7- oo, which simply says that the parametric rate of conver- 
gence of risk 0(n~^) is not achievable. This is a mild condition since we 
are considering nonparametric estimation. When nj's are not bounded, con- 
dition 3's verification should be done on a case-by-case basis. As a spe- 
cial case, recent results for the longitudinal function estimation by Cai and 
Yuan (2011) indicate that condition 3 would be satisfied in this particular 
setting if ^(S, W)/n* = 0(1) and nVn^s^ ^ or ^(S, W)/n* = o(l) and 
n* /n^/"^^ oo for some r > 1, where n* = (-Y^" i — l""*^ is the harmonic 
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mean of ni , . . . , n„ . This conclusion holds for both fixed common designs 
and independent random designs. 

Condition 4 essentially says that ^(S, W){7i~Hr(A)}2 =o(i?(W,A)). It 
is straightforward to show that the left-hand side is bounded from above by 
c(I]W-i)c(W){tr(A)/n}V{tr(A2)/n}, where c(M) = A^ax(M)/A^in(M) 
is the condition number of a matrix M. If n^'s are bounded, for choices 
of W such that SW~^ and W are not singular, to ensure condition 4 
holds it suffices to have that {tr(A)/n}^/{tr(A^)/n} = o(l). For regression 
splines (A = 0), this condition holds if p/n ^ where p is the number of 
basis functions used, since tr(A^) = tr(A) = p. For penalized splines and 
smoothing splines, we provide a more detailed discussion in Section 2.4. 

If the working correlation matrix W is chosen to be well-conditioned 
such that its condition number Amax ( W) / Amin ( W) is bounded, condition 5 
reduces to tr(A)/n 0, which can be verified as condition 4. 

Conditions 3-5 all indicate that a bad choice of the working correlation 
matrix W may deteriorate the performance of using the LsoCV. For exam- 
ple, conditions 3-5 may be violated when S~^W or W is nearly singular. 
Thus, in practice, it is wise to avoid using a working correlation W that is 
nearly singular. 

We do not make the assumption that n^'s are bounded. However, rii obvi- 
ously cannot grow too fast relative to the number of subjects n. In particular, 
if rij's are too large, Amax(^W~^) can be fairly large unless W ~ 5], and 
Amax(W) can be fairly large due to increase of dimensions of the working 
correlation matrices for individual subjects. Thus, conditions 3-5 implicitly 
impose a limit to the growth rate of nj. 

2.4. An example: Penalized splines with B-spline basis functions. In this 
section, we provide an example where conditions 3-5 can be discussed in 
a more specific manner. Consider model (1) with only one nonparametric 
covariate x and thus there is only one penalty parameter A. We further 
assume that all eigeinvalues of matrices W and 5]W^^ are bounded from 
below and above, that is, there exist positive constants ci and C2 such that 

Cl < Amin(W) < Ainax(W) < C2 and Ci < A„,in(SW-l) < Amax(5]W-l) < C2. 

Under this assumption, it is straightforward to show that conditions 3-5 
reduce to the following conditions. 

Condition 3'. nii(W, A) ^ oo as n ^> oo. 
Condition 4'. {n"Hr(A)}2/{n-4r(A2)} = o(l). 
Condition 5'. tr(A)/n = o(l). 

Using Lemmas 4.1 and 4.2 from Han and Gu (2008) and similar argu- 
ments, we have the following three inequalities: 

(10) tr{A(c2A, I)} < tr{A(A, W)} < tr{A(ciA, I)}, 

(11) tr{A2(c2A,I)} < tr{A2(A, W)} < tr{A2(ciA, I)} 
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and 

cic^Hl-A(c2A,I)} 

(12) 

< {I - A(A, W)f {I - A(A, W)} < C2C3{I - A(c2A,I)}, 

where C3 = exp{c2(l + (cj~^ — c^"*^)^ + (c^^ — c^^))}- These inequahties and 
the definition of the risk function -R(W, A) imply that we need only to check 
conditions 3'-5' for the case that W = I. In particular, (10)-(12) imply that 

C1C3 - A(C2A, I)} V + cl tr{A2(c2A, I)} 

< nR{W, A) < C2C3M^{I - A(ci A, I)} V + 4 tr{A2(ci A, I)}, 
and, therefore, to show condition 3', it suffices to show 
(13) A(A,I)}V^oo or tr{A2(A, I)} ^ 00 

as n — )■ 00. 

We now use existing results from the literature to show how to verify 
conditions 3'-5'. Note that the notation used in the literature of penalized 
splines and smoothing splines is not always consistent. To fix notation, we 
denote for the rest of this section that A* = X/N and A*(A*) = A(A,I), 
where N is the total number of observations from all subjects. 

Let r denote the order of the B-splines and consider a sequence of knots de- 
fined on the interval [a, b], a = t_^r-i) = ■ ■ ■ = to < h < ■ ■ ■ < tK„ < tK„+i = 
■ ■ ■ = tK„+r = b- Define B-spline basis functions recursively as 

i?,i(x)=r' ^^-f ^<*^+i' 

[0, otherwise, 

BjA^) = T^^^B,,r-l{x) + /^•+;^~/ i?,+l,.-l(x) 

for j = — (r — 1), . . . , Kn- When this B-spline basis is used for basis expansion, 
the jth row of Xj is X^^.^ = (-B_(^„i) ^.(xjj), . . . , BK„,rixij)), for j = l,...,ni 
and i = 1, . . . , n. When the penalty is the integrated squared qth derivative of 
the spline function with q<r — 1, that is, / (Z^"^^)^, the penalty term can be 
written in terms of the spline coefficient vector (3 as A/3^A^i?Aq/3, where R 

is a {Kn + r — q) x (i^„ + r — q) matrix with Rij = Bj^j.~q{x)Bi^r~q{x) dx 
and Aq is a matrix of weighted gth order difference operator [Claeskens, 
Krivobokova and Opsomer (2009)]. 

We make the following assumptions: (a) 5 = maxo<j<x„(ij+i — tj) is 
of the order 0{K~^) and 5/ mino<j<_R-„(tj+i — tj) < M for some constant 
M > 0; (b) sup^g[a^fe] \Qn{x) — Q{x)\ = o(J^^^), where Qn and Q are the 
empirical and true distribution function of all design points {xi, . . . ,xn}', 
(c) Kn = o{N). Define quantity K^ = (K„ + r-g)(A*ci)i/(2g) ^ith some con- 
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stant ci > depending on q and the design density. Claeskens, Krivobokova 
and Opsomer (2009) showed that, under above assumptions, if Kg < 1, 
tr{A*(A*)} and tr{A*2(A*)} are both of the order 0(K„) and - 
A*(A*)}V = 0{X*^NK^n'^ + NK-^^); Kg > 1, tr{A*(A*)} and tr{A*2(A*)} 
are of order 0(A*-i/(2<7)) ^nd fi'^il- A*{X*)y fi = 0{N\* + NKn^"). Using 
these results and the results following inequalities (10)-(12), it is straight- 
forward to show that if A* = (for regression splines), letting Kn — )■ oo and 
Kn/n — > is sufficient to guarantee conditions 3'-5', and if A* / (for pe- 
nalized splines), further assuming A* ^ and nX*^^^"^^^ oo ensures the 
validity of conditions 3'-5'. 

When Kg > 1, the asymptotic property of the penalized spline estimator 
is close to that of smoothing splines, where the number of internal knots 
Kn = N. In fact, as discussed in Han and Gu (2008), for smoothing splines, 
it typically holds that tr{A*(A*)} and tr{A*2(A*)} are of order 0(A*~i/'^) 
and A*(A*)}^/x = 0(A^A*) for some d > 1 as iV oo and A* -> 0; see 

also Craven and Wahba (1979), Li (1986) and Gu (2002). Therefore, if one 
has A* ^ and nA*^/'^ — )■ oo, conditions 3'-5' can be verified for smoothing 
splines. 

2.5. Optimality of leave- subject- out CV. In this subsection, we provide 
a theoretical justification of using the minimizer of LosCV(W, A) to select 
the optimal value of the penalty parameters A. We say that the working cor- 
relation matrix W is predetermined if it is determined by observation times 
and/or some other covariates. One way to obtain such W is to use some 
correlation function plugged in with estimated parameters. Naturally, it is 
reasonable to consider the value of A that minimizes the true loss function 
L(W, A) as the optimal value of the penalty parameters for a predetermined 
W. However, L(W, A) cannot be evaluated using data alone since the true 
mean function in the definition of -L(W, A) is unknown. One idea is to use 
an unbiased estimate of the risk function i?(W, A) as a proxy of L(W, A). 
Define 

(14) C/(W, A) = -Y'^il - Afil - A)Y + - tr(AS). 

n n 

It is easy to show that 

(15) UCW, A) - L(W, A) - = -fi^Cl - Afe - -{e'^Ae - tr(AS)}, 

n n n 

which has expectation zero. Thus, if E is known, C/(W, A) - e^e/n IS an 
unbiased estimate of the risk i?(W, A). Actually, the estimator is consistent, 
as stated in the following theorem. 

Theorem 2.1. Under conditions 1-4, for a predetermined W and a 
nonrandom X, as oo, 

L(W, A) - R{W, A) = Op{R{W, A)) 
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and 

UCW, A) - L(W, A) - = Op(L(W, A)). 

n 

This theorem shows that the function C/(W, A) — e'^e/n, the loss func- 
tion L(W, A) and the risk function i?(W, A) are asymptotically equivalent. 
Thus, if S is known, C/(W, A) - s'^e/n is a consistent estimator of the risk 
function and, moreover, ^7(W, A) can be used as a reasonable surrogate of 
L(W,A) for selecting the penalty parameters, since the e^e/n term does 
not depend on A. However, C/(W,A) depends on knowledge of the true co- 
variance matrix S, which is usually not available. The following result states 
that the LsoCV score provides a good approximation of ?7(W, A), without 
using the knowledge of S. 

Theorem 2.2. Under conditions 1-5, for a predetermined W and a 
nonrandom A, as oo, 

LsoCV(W, A) - U{W, A) = Op(L(W, A)) 

and, therefore, 

LsoCV (W, A) - L(W, A) - -e^e = oJLfW, A)). 

n 

This theorem shows that minimizing LsoCV(W, A) with respect to A is 
asymptotically equivalent to minimizing C/(W, A) and is also equivalent to 
minimizing the true loss function L(W, A). Unlike C/(W,A), LsoCV(W, A) 
can be evaluated using the data. The theorem provides the justification 
of using LsoCV, as a consistent estimator of the loss or risk function, for 
selecting the penalty parameters. 

Remark 1 . Although the above results are presented for selection of the 
penalty parameter A for penalized splines, the results also hold for selection 
of knot numbers (or number of basis functions) Kn for regression splines 
when A = and Kn is the tuning parameter to be selected. 

Remark 2. Since the definition of the true loss function (7) does not 
depend on the working correlation structure W, we can use this loss function 
to compare performances of different choices of W, for example, compound 
symmetry or autoregressive, and then choose the best one among several 
candidates. Thus, the result in Theorem 2.2 also provides a justification for 
using the LsoCV to select the working correlation matrix. This theoretical 
implication is also confirmed in a simulation study in Section 4.3. When 
using the LsoCV to select the working correlation matrix, we recommend to 
use regression splines, that is, setting A = 0, because this choice simplifies 
computation and provides more stable finite sample performance. 
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3. Efficient computation. In this section, we develop a computationally 
efficient Newton-Raphson-type algorithm to minimize the LsoCV score. 

3.1. Shortcut formula. The definition of LsoCV would indicate that it 
is necessary to solve n separate minimization problems in order to find the 
LsoCV score. However, a computational shortcut is available that requires 
solving only one minimization problem that involves all data. Recall that A 
is the hat matrix. Let An denote the diagonal block of A corresponding to 
the observations of subject i. 

Lemma 3.1 (Shortcut formula). The LsoCV score satisfies 
1 " 

(16) LsoCV(W^, A) = - V(yi - yif{Iu - Au)-^{Iu - AuyHyi - fi), 

n ^-^ 

i=l 

where In is a rii x m identity matrix, and yi = /i(Xj). 

This result, whose proof is given in the supplementary material [Xu and 
Huang (2012)], extends a similar result for independent data [e.g.. Green 
and Silverman (1994), page 31]. Indeed, if each subject has only one obser- 
vation, then (16) reduces to LsoCV = (1/n) X]"=i(yi — 2/j)^/ (1 — Om)^, which 
is exactly the shortcut formula for the ordinary cross-validation score. 

3.2. An approximation of leave-subject-out CV. A close inspection of the 
short-cut formula of LsoCV(W,A) given in (16) suggests that the evalu- 
ation of LsoCV(W, A) can still be computationally expensive because of 
the requirement of matrix inversion and the formulation of the hat ma- 
trix A. To further reduce the computational cost, using Taylor's expan- 
sion (Ijj — Aii)~'^ ~ Ijj -|- Ajj, we obtain the following approximation of 
LsoCV(W,A): 

1 2 

(17) LsoCV* (W, A) = - Y^(I - A)'^(I - A)Y + - V ej Aueu 

n n ^-^ 

i=l 

where is the part of e = (I — A)Y corresponding to subject i. The next 
theorem shows that this approximation is a good one in the sense that its 
minimization is asymptotically equivalent to the minimization of the true 
loss function. 

Theorem 3.1. Under conditions 1-5, for a predetermined W and a 
nonrandom A, as oo, we have 

LsoCV* (W, A) - L(W, A) - -e^e = Op(L{W, A)). 

n 

This result and Theorem 2.2 together imply that LsoCV* (W, A) and 
LsoCV(W, A) are asymptotically equivalent, that is, for a predetermined 
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W and a nonrandom A, LsoCV(W, A) -LsoCV*(W, A) = Op(L(W, A)). The 
proof of Theorem 3.1 is given in the Appendix. 

We developed an efficient algorithm to minimizing LsoCV*(W, A) with 
respect to A for a pre-given W based on the works of Gu and Wahba (1991) 
and Wood (2004). The idea is to optimize the log transform of A using the 
Newton-Raphson method. The detailed algorithm is described in the sup- 
plementary material [Xu and Huang (2012)] and it can be shown that, for 
LsoCV*(W, A), the overall computational cost for each Newton-Raphson 
iteration is 0{Np), which is much smaller than the cost of directly minimiz- 
ing LsoCV(W, A) {0{Np'^)) when the total number of used basis functions 
p is large. 

4. Simulation studies. 

4.1. Function estimation. In this section, we illustrate the finite-sample 
performance of LsoCV* in selecting the penalty parameters. In each sim- 
ulation run, we set n = 100 and = 5, i = 1, . . . ,n. A random sample is 
generated from the model 

(18) Vij = fiixi,i) + f2ix2,ij) +eij, j = l,...,5,i = l,...,100, 

where xi is a subject level covariate and X2 is an observational level covariate, 
both of which are drawn from Uniform(— 2, 2). Functions used here are from 
Welsh, Lin and CarroU (2002): 



where z = (x -|- 2)/4. The error term Sij 's are generated from a Gaussian dis- 
tribution with zero mean, variance o"^ and the compound symmetry within- 
subject correlation, that is. 



j,l = 1, . . . ,5, i,k = 1, . . . , 100. In this subsection, we take a = 1 and p = 0.8. 
A cubic spline with 10 equally spaced interior knots in [—2,2] was used for 
estimating each function. Functions were estimated by minimizing (2) with 
two working correlations: the working independence (denoted as Wi = I) 
and the compound symmetry with p = 0.8 (denoted as W2). Penalty pa- 
rameters were selected by minimizing LsoCV* defined in (17). The top two 
panels of Figure 1 show that the biases using Wi and W2 are almost the 
same, which is consistent with the conclusion in Zhu, Fung and He (2008) 



/i (x) = V^^-z) sin (^27r^^— ^ j , 
f2{x) = sin(8z - 4) + 2exp(-256(z - 0.5)^) 




(19) 




if i = j = k = I, 
i{i = k, j ^ I, 



otherwise 
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Fig. 1. Simulation results for function estimation based on 200 Monte Carlo runs. Func- 
tions are evaluated over 100 equally spaced grid points in [—2,2]. Top panels: estimated 
functions: solid — true functions; dashed — average of estimates using Wi ; dotted — aver- 
age of estimates using W2 (not distinguishable with dashed). Bottom panels: variance of 
estimated functions: solid — estimates using Wi; dashed — estimates using W2. 



that the bias of function estimation using regression sphnes does not depend 
on the choice of the working correlation. The bottom two panels indicate 
that using the true correlation structure W2 yields more efficient function 
estimation, and the message is more clear in the estimation of f2{x). 

4.2. Comparison with an existing method. Assuming that the structure 
of W is known up to a parameter 7 and the true covariance matrix 5] is 
attained at 7 = 70, Han and Gu (2008) proposed to simultaneously select 7 
and A by minimizing the following criterion: 

(20) V*(W, A) = log{Y^W^/2(I - kfW^/^Y/N} - 1 log | W| + -^^^ 

where N is the total number of observations. They proved that V* is asymp- 
totically optimal in selecting both the penalty parameter A and the correla- 
tion parameter 7, provided that the within subject correlation structure is 
correctly specified. In this section, we compare the finite sample performance 
of LsoCV* and V* in selecting the penalty parameter when the working cor- 
relation matrix W is given and fixed. 

We generated data using (18) and (19) as in the previous subsection and 
considered different parameters for the correlation matrix. In particular, we 



14 



G. XU AND J. Z. HUANG 



W CO 



5 



C 
_l 

> 
5 





o ^ ° 

i 1 X ^ 


1 

4- 




o -r — 

8 8 . . 


0.5 


0.6 0.7 0.8 0.9 1 




a 




° . I ! 


T 


^ - T : 

T 


T t 

o 8 



^ 00 

5 <o 



o 



o ^ 

VI 
_I 

O 

5 - 



o 



0.5 0.6 0.7 0.8 0.9 1 

a 



T ■ 



0.1 0.3 0.5 0.7 0.9 
P 



0.1 0.3 0.5 0.7 0.9 
P 



Fig. 2. Relative efficiency of LsoCV* to V* and to the true loss when the working 
correlation matrix is the same as the true correlation matrix. 



fixed p = 0.8 and varied the noise standard deviation a from 0.5 to 1; we also 
fixed £7 = 1 and varied p from —0.2 to 0.9. A cubic spline with 10 equally 
spaced interior knots was used for each unknown regression function. For 
each simulation run, to compare the effectiveness of two selection criteria for 
a given working correlation matrix W, we calculated the ratio of true losses 
at different choices of penalty parameters: -L(W, Av*)/i(W, AlsoCv) and 
L(W, Aopt)/^(W, AlsoCV*)' where Ay and AlsoCV* are penalty parameters 
selected by using V* and LsoCV*, respectively, and Aopt is obtained by 
minimizing the true loss function defined in (7) assuming the mean function 
/i(-) is known. 

In the first experiment, the true correlation matrix was used as the work- 
ing correlation matrix, denoted as Wi . This is the case that V* is expected 
to work well according to Han and Gu (2008). Results in Figure 2 indicate 
that performances of LsoCV* and V* are comparable for this case regard- 
less of values of a or p. In the second experiment, the working correlation 
structure was chosen to be different from the true correlation structure. 
Specifically, the working correlation matrix, denoted as W2, is a truncated 
version of (19) where the correlation coefficient between £ij^ and Eij^ is 
set to p if \ji — j2| = 1 and if \ji — j2| > 2. Results in Figure 3 show that 
LsoCV* becomes more effective than V* in terms of minimizing the true loss 
of estimating the true mean function /i(-) as a or p increases. These results 
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Fig. 3. Relative efficiency of LsoCV* to V* and to the true loss when the working 
correlation matrix is different from the true correlation matrix. 



are understandable since V* is applied to a situation that it is not designed 
for and its asymptotic optimality does not hold. Moreover, from the right 
two panels of Figures 2 and 3, we see that the minimum value of LsoCV* 
is reasonably close to the true loss function assuming the knowledge of the 
true function, as indicated by the conclusion of Theorem 3.1. 

4.3. Correlation structure selection. We conducted a simulation study to 
evaluate the performance of LsoCV* in selecting the working correlation ma- 
trix W. The data was generated using the model (18) with o" = 1, n, = 5 for 
alH = 1, . . . , n. In this experiment, both xi and X2 are set to be observational 
level covariates drawn from Uniform(— 2, 2). Four types of within-subject 
correlation structures were considered: independence (IND), compound sym- 
metry with correlation coefficient p (CS), AR(1) with lag-one correlation p 
(AR), and unstructured correlation matrix with pi2 = P23 = 0.8, pis = 0.3 
and otherwise (UN). Data were generated using one of these correlation 
structures and then the LsoCV* was used to select the best working cor- 
relation from the four possible candidates. A cubic spline with 10 equally 
spaced interior knots in [—2, 2] was used to model each unknown function 
and we set the penalty parameter vector A = 0. Table 1 summarizes the re- 
sults based on 200 simulation runs for each setup. We observe that LsoCV* 
works well: the true correlation structure is selected in the majority of times. 
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Table 1 

Simulation results for working correlation structure selection 



Selected structure 



n p True structure 


IND 


CS 


AR 


UN 


50 0.3 


IND 


97.0 


2.0 


1.0 







CS 


8.5 


78.0 


13.5 







AR 


13.5 


10.0 


76.5 







UN 


1.5 


1.5 


21.5 


75.5 


0.5 


IND 


96.5 


2.5 


1.0 







CS 


3.0 


78.5 


18.5 







AR 


4.0 


9.5 


86.5 







UN 


3.5 


4.0 


11.5 


81.0 


0.8 


IND 


98.5 


1.0 


0.5 







CS 


3.5 


74.0 


22.0 


0.5 




AR 


5.5 


21.0 


71.0 


2.5 




UN 


5.5 


1.0 


8.5 


85.0 


100 0.3 


IND 


95.0 


3.0 


2.0 







CS 


2.0 


84.5 


13.5 







AR 


3.5 


8.5 


88.0 







UN 





1.0 


13.5 


85.5 


0.5 


IND 


99.5 


0.5 










CS 


2.5 


81.0 


16.5 







AR 


1.0 


6.0 


93.0 







UN 


2.0 


0.5 


10.0 


87.5 


0.8 


IND 


99.0 


1.0 










CS 


2.5 


73.5 


24.0 







AR 


2.0 


20.0 


76.5 


1.5 




UN 


5.5 


2.0 


9.0 


83.5 


150 0.3 


IND 


98.5 


1.0 


0.5 







CS 


2.0 


85.0 


13.0 







AR 


2.5 


5.5 


92.0 







UN 








16.5 


83.5 


0.5 


IND 


100 













CS 


1.0 


81.5 


17.5 







AR 


2.5 


8.5 


89.0 







UN 


0.5 





12.0 


87.5 


0.8 


IND 


99.5 


0.5 










CS 


1.0 


78.0 


20.0 


1.0 




AR 


0.5 


18.5 


77.5 


3.5 




UN 


1.0 


2.0 


6.5 


90.5 


5. A real data example. As a subset from the Multi-center AIDS Co- 
hort Study, the data set includes the repeated measurements of CD4 cell 
counts and percentages on 283 homosexual men who became HIV-positive 
between 1984 and 1991. All subjects were scheduled to take their measure- 
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ments at semi-annual visits. However, since many subjects missed some of 
their scheduled visits, there are unequal numbers of repeated measurements 
and different measurement times per subject. Further details of the study 
can be found in Kaslow et al. (1987). 

Our goal is a statistical analysis of the trend of mean CD4 percentage 
depletion over time. Denote by tij the time in years of the jth measurement 
of the ith individual after HIV infection, by yij the ith individual's CD4 

percentage at time tij and by X^^^ the ith individual's smoking status with 
values 1 or for the iih individual ever or never smoked cigarettes, respec- 
tively, after the HIV infection. To obtain a clear biological interpretation, we 

(2) 

define to be the ith individual's centered age at HIV infection, which 
is obtained by the ith individual's age at infection subtract the sample av- 
erage age at infection. Similarly, the ith individual's centered pre-infection 

(3) 

CD4 percentage, denoted by X^ , is computed by subtracting the average 
pre-infection CD4 percentage of the sample from the ith individual's ac- 
tual pre-infection CD4 percentage. These covariates, except the time, are 
time-invariant. Consider the varying-coefficient model 

(21) y,, = (3o{tij) + )/3i(ti,-) + xP(32{tij) + Xf^p^itij), 

where /3o(i) represents the trend of mean CD4 percentage changing over time 
after the infection for a nonsmoker with average pre-infection CD4 percent- 
age and average age at HIV infection, and /3i(t), /32(t) and I3^{t) describe 
the time-varying effects on the post-infection CD4 percentage of cigarette 
smoking, age at HIV infection and pre-infection CD4 percentage, respec- 
tively. Since the number of observations is very uneven among subjects, we 
only used subjects with at least 4 observations. A cubic spline with k = 10 
equally spaced knots was used for modeling each function. We first used the 
working independence Wi = I to fit the data and then used the residuals 
from this model to estimate parameters in the correlation function 

7(ti; a, 9) = a + {1 — a) exp{—6u), 

where u is the lag in time and 0<a<l,^>0. This correlation function was 
considered previously in Zeger and Diggle (1994). The estimated parameter 
values are {a, 9) = (0.40,0.75). The second working correlation matrix W2 
considered was formed using j{u;a,9). We computed that LsoCV(Wi,0) = 
881.88 and LsoCV(W2,0) = 880.33, which implies that using W2 is prefer- 
able. This conclusion remains unchanged when the number of knots varies. 
To visualize the gain in estimation efficiency by using W2 instead of Wi, 
we calculated the width of the 95% pointwise bootstrap confidence intervals 
based on 1000 bootstrap samples, which is displayed in Figure 4. We can ob- 
serve that the bootstrap intervals using W2 are almost uniformly narrower 
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Fig. 4. Width of the 95% pointwise bootstrap confidence intervals based on 1000 bootstrap 
samples, using the working independence Wi (solid line) and the working correlation 
matrix 'W 2 (dashed line). 



than those using Wi , indicating higher estimation efficiency. The fitted co- 
efficient functions (not shown to save space) using W2 with A selected by 
minimizing LsoCV*(W2, A) are similar to those published in previous stud- 
ies conducted on the same data set [Wu and Chiang (2000), Fan and Zhang 
(2000), Huang, Wu and Zhou (2002)]. 

APPENDIX: TECHNICAL PROOFS 

This section is organized as follows. We first give three technical lemmas 
(Lemmas A.1-A.4) needed for the proof of Theorem 2.1. After proving The- 
orem 2.1, we give another lemma (Lemma A. 5) that facilitates proofs of 
Theorems 2.2 and 3.1. We prove Theorem 3.1 ffi'st and then proceed to the 
proof of Theorem 2.2. 

Let Amax(M) = Ai(M) > A2(M) > • • • > Ap(M) = Amm(M) be eigenvalues 
of the p X p symmetric matrix M. We present several useful lemmas. 

Lemma A.l. For any positive semi- definite matrices Mi and M2, 
(22) Ai(Mi)Ap(M2)<\(MiM2)<A,(Mi)Ai(M2), i = l,...,p. 

Proof. See Anderson and Das Gupta (1963) and Benasseni (2002). □ 
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Lemma A. 2. For any positive semi- definite matrices Mi and M2, 

(23) tr(MiM2) < Ama.(Mi) tr(M2). 

Proof. The proof is trivial, using the eigen decomposition of Mi. □ 

Lemma A. 3. Eigenvalues o/A^AI] and (I — A)^(I — A)S are hounded 
above by ^(S, W) = A„ax(5]W~^)Amax(W). 

Proof. Recall that A = W^i/^AW^^^ Yot AEA^, by Lemma A.l, 
Ai(A^AI]) = Ai(AWAW-i/25.^-i/2) 

< Ai(AWA)A„,ax(5:W"i) 

< Ai(A2)A„,ax(W)A„,ax(5:W-i) < e(5], W). 

The last inequality follows from the fact that maxj{Ai(A^)} < 1. Similarly, 
Ai((I- A)^(I- A)S) <C(S,W) follows from maxi{Ai((I - A)^)} < 1. □ 

Denote e = (ej^, . . . , e^)^, where e^'s are independent random vectors with 
length nj, S(ej) = and Var(e) = Ij for i = 1, . . . , n. For each i, define Zij = 
(u^ej)^ where u^Uj^ = 1 if j = /c and otherwise, j, = 1, . . . , n^. 

Lemma A. 4. If there exists a constant K such that E{zf-) < K holds 
for all j = 1, . . . ,ni, i = 1, . . . ,n, then 

n 

(24) Var(e^Be) < 2tr(BB^) + kY,MB*,)}\ 

1=1 

where B is any N x N matrix (not necessarily symmetric), Bjj is the ith 
{ni X ni) diagonal block of B and B*. is an "envelop" matrix such that 
B*^ lb (Bjj -|- B^)/2 are positive semi-definite. 

The proof of this lemma is given in the supplementary material [Xu and 
Huang (2012)]. 

Proof of Theorem 2.1. In light of (9) and (15), it suffices to show 
that 

(25) L(W,A)-/2(W,A)=0p(iZ(W,A)), 

(26) -n^{l-Afe = 0p{R{W,X)), 

n 

(27) -{e^Ae-tr(AI])} = Op(i?(W,A)) 
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because, combining (25)-(27), we have 



UCW, A) - L(W, A) - -e'^e = o„(L(W, A)). 

n 



We first prove (25). By (8), we have 

1 



(28) Var(L(W, A)) = ^ Var{£^A^Ae - 2/x^(I - A)^Ae}. 



Define B = ^^/^A^AS^/^. Then e^A^Ae = {i:-^/^e)^B{'S-^/^e). Since 
B is positive semi-definite, by applying Lemma A. 4 with e = B = 

5]^/2aTasV2 and B*. =Bii, we obtain 

(29) ^ Var(e^A^Ae) < ^tr{B') + ^f^MB,.)}' 

i=l 

for some K > as defined in Lemma A. 4. By Lemmas A. 2 and A. 3, under 
condition 3, we have 

2 t,(B2) < 2^max(A^AE) ^^^^T^S) 

(30) 

< M^^itr(A^AS) = o(ii2(W, A)), 
n n 

Recall that Cu is the ith diagonal block of A^. Then, under condition 2(ii), 
tr(Cii)~o(l). Thus, 

tr(B,i) = tr(LiI]i/2w-^/2AWAW-i/2l]^/2Lf ) 

< A^ax(W) tr(AW-i/2l]i/2Lf Li5]i/2w-i/2 A) 

(31) = A„,ax(w) tT{CiiW;'/^^iW;'/^) 

max 

(SjW. ^)tr(Cn) 

= o(l)e(S,W). 

Since X]"=i{tr(Bjj)} = tr(B) =tr(A-^A5]), under condition 3, 

^ V^/f rPi U2 .-,^^e(5],W)tr(B) 
2^|tr(Bii)| =o(l) 

(32) 



o(l)^ii^l^ltr(A^AS) = o(i22(w,A)). 

n n 



Combining (29)-(32), we obtain 



^Var(e^A^Ae) -o(/?2(w,A)). 
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Var{/x^(I - A)^A£} = ^/^^(I - A)^ASA^(I - A)^l 

T Z ry-,Z 



Since Amax(A AS) < ^(S, W) by Lemma A. 3, under condition 3, 
^Var{/x^(I- Af A£} = ^ 

(33) 

,™V(I-A)-(I-A). 
= o(i?2(w,A)). 

Combining (28)-(33) and using the Cauchy-Schwarz inequality, we obtain 
Var(L(W,A)) = o(ii2(w,A)), which proves (25). 

To show (26), by Lemma (A. 3) and condition 3, we have 

^ Var{/x^(I - A)^£} = ^/x^(I - A)^S(I - A)/x 



<^^^^5£lMV(I-A)^(I-A)/. 



n n 

< ii^^i;,^(I - A)^(I - A)/x = o(i?2(w. A)). 

n n 

The result follows from an application of the Chebyshev inequality. 

To show (27), applying Lemma A.4 with e = 'E-'^/'^e, B = S^/^AE^/^. 
For each Bu = ^1^^ Au'E]/'^ , noticing that (W^^^^ - aWr^/2)Aii(w|/^ - 

— 1/2 

aW^ ) is positive semi-definite, we can define an "envelop" matrix as 
B*, = li:y\w]^^ X AuWy^/ai + aiW-'/^AuWr'/^)^'/^ for any > 0. 
Then by Lemma A.4, we obtain 

^ Var(e^Ae) = 4TVar(e^Be) 

(34) 

<Atr(BB^) + |f;{tr(B:J}^ 

i=l 

where K is as in Lemma A.4. By Lemma A. 2, under condition 3, we have 
4jtr(BB^) = ^tr(5]AEA^) < ^^HlSiMl tr(A^A5]) 

< M^^itr(A^AS) = o{R^{W,X)). 

n n 

By using Lemma A.l repeatedly and taking ai = Amax(Wj), we have 
tr(B:j =tr(A,,I]y=^W,5]|/')/(2a,) + «^tr(A,,E;/'wril]|/2^/2 



22 G. XU AND J. Z. HUANG 

< A,^ax(S,Wri)Ai„ax(Wi)tr(A,,) 

<e(S,W)tr(AiO. 
Under conditions 2(i), 3 and 4, we have 

(35) ^X^{tr(B:,)}2<:^e'(S,W)0(n-2tr(A)2) = o(i?2(w,A)). 

i=l 

Therefore, combining (34)-(35) and noticing conditions 1-4, we have 

^Var(e^A£)~o(i22(w,A)), 

which leads to (27). □ 

To prove Theorem 2.2, it is easier to prove Theorem 3.1 first. The following 
lemma is useful for the proof of Theorem 3.1. 

Lemma A. 5. Let D = diagjDn, . . . , D„„} he a diagonal block matrix 
and D* = diag{D|]^, . . . , D*,„} he a positive semi- definite matrix such that 
D* ± (D + D^)/2 are positive semi-definite. In addition, Tin 's and 
D*. 's meet the following conditions: (i) maxi<j<„{tr(D*^Wj)} ~ 
Amax(W)0(n"itr(A));(ii) maxi<,<„{tr(DiiW,D^} ~ A^ax(W)0(n-2tr(A)2). 
Then, under conditions 1-5, we have 

4t Var{Y^(I - A)'^D(I - A)Y} = o{R\W, A)). 

The proof is given in the supplementary material [Xu and Huang (2012)]. 

Proof of Theorem 3.1. By Theorem 2.1, it suffices to show that 
LsoCV*(W, A) - C/(W, A) = Op{R{W, A)), 
which can be obtained by showing 

(36) £;{LsoCV*(W, A) - C/(W, A)}^ = Op{R^{W, A)). 
Hence, it suffices to show that 

(37) ^{LsoCV*(W,A)-[/(W,A)} = o(i?(W,A)) and 

(38) Var{LsoCV*(W, A) - [/(W, A)} = o(i?2(w. A)). 

Denote A^ = diag{Aii, . . . , A„„} and A^ = diag{Aii, . . . , A„„}. It fol- 
lows that Ad = W^V2 and n^itr(A^) = 0{n^'^tT{Af ) by condi- 
tion 2. Some algebra yields that 

LsoCV*(W, A) - UiW, A) = -Y'^(I - A)'^Arf(I - A)Y - - tr(AI]). 
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First consider (37). We have that 
^{LsoCV*(W, A) - U{W, A)} 

(39) = i/x^(I - Af{A, + Aj)(I - A)/x 

+ - tr{A^(Arf + Aj) AE} - - tr(AjAdS) - - tr(A2s). 

n n n 

We shall show that each term in (39) is of the order o(i?(W, A)). 

Condition 2 says that maxi<j<„tr(Ajj) = 0{n~^ tic{A)) = o(l). Using con- 
ditions 2 and 5, we have 

tr(A,, + Ajf = 2tr(A2, + A^^A^) 

= 2tr(A2. + AiiWiAiiW-i) 

< 2tr(A2.){l + A„,ax(Wri)A^ax(W,)} 

= A„,ax(W)A„,ax(W-l)0(n-2 ^j.(^)2) ^ ^^^^^ 

which implies that all eigenvalues of (A^-l- AJ) are of order o(l), and, hence, 
i/i^(I - A)^(A, + Aj)(I - A)ti = o(l)i/z^(I - A)^(I - A)/x = o(i?(W, A)), 

- tr{A^(Arf + Aj) AS} = o(l)i tr(A'^AI]) = o(R(W, A)), 
n n 

Under condition 4, the third term in (39) can be bounded as 

-tr(AjArfS) < A^ax(5^W-i)itr(Ay'wi/2Arf) 
n n 

1 



(40) <aS,W)-tr(A^) 

n 



= e(S, W)0(n-2 tr(A)2) = o{R{W, A)). 

1 /2 — 1 /2 1/2 

For the last term in equation (39), observe that (W. — OiW- )'S{W- — 

— 1/2 

OiW- ) is positive semi-definite for any aj. Taking a.j = Aniax(Wj), we 
have 

-tr(AlS) = -tr(AlW"V2swi/2) < max trjA^ (S* + Sf )} 

n n i<i<n 



< max tr{A2 (W|/'s,wy + «^W, '/'s,W. ^/')} 

l<i<n 



< max {A„iax(SiWr^)Arnax(W,)tr(AfJ} 

l<j<n 

<e(S,W)0(n-2tr(A)2) = o(i?(W,A)), 

~ 1 /2 1 /2 

where S* = W- ^ SiW^' . Equation (39) and thus (37) have been proved. 
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To prove (38), define D = and the corresponding "envelop" matrix 
D* = diag{D^]^, . . . ,D*„}, where the diagonal blocks are defined as D*j = 
i(Wi/2 X A,,w|/Vai + aiW-'/'A,,Wri/') with = X^^A^i), then since 

tr(A,,W,A^) = tr(A2.W0 < A^ax(Wi){tr(Ai,)}' and 

tr(D*,W,)<A„,ax(W,)tr(AiO, 

we have that maxi<j<„ tr(AjjWjA^) = Amax(W)0(n~^ tr(A)^) and that 
maxi<i<„tr(D*^Wj) = Amax(W)0(n~-'^ tr(A)) by condition 2. Under condi- 
tions 3-4, (38) follows from Lemma A. 5. □ 

Proof of Theorem 2.2. By Theorem 3.1, it suffices to show 
LsoCV(W,A) -LsoCV*(W,A) = Op(L(W,A)), 
which can be proved by showing that 

£;{LsoCV(W, A) - LsoCV*(W, A)}^ = Op{R^{W, A)). 
It suffices to show 

(41) E{LsoCV(W,A)-LsoCV*(W,A)} = o(i?(W,A)) and 

(42) Var{LsoCV(W, A) - LsoCV*(W, A)} = o{R^{W, A)). 

For each i = l,...,n, consider the eigen-decomposition An = PjAjP?^, 
where Pj is a rij x rii orthogonal matrix and Aj = diagjAji, • • • , Aj.„.}, Xij > 0. 
Using this decomposition, we have 

where A* is a diagonal matrix with diagonal elements (1 — Ajj)"^, j = 
l,...,nj. Since under condition 2 maxi<j<„.{Aij} ~ o(l), we have (1 — 
Xij)~^ = Y.kLo ^ij^ which leads to 

oo oo 

(I..-A,,)-^ = 5]P.Afpf = J]At 

fe=0 fc=0 

k=m ^ii — • • • , 



Define D(-) = diaglDl^), . . .,I^t^}, where fyf^ = EZm^u 
m = 1, 2, It follows that, for each i, 

tr(DSr)) = f; tr(A^J < {tr(A..)}'= = \''^ff^\ . 

7 — 7 — 1 — tr AjJ 

k=m k=m ^ ' 

Since condition 2(i) gives maxi<j<„ tr(Ajj) ~ 0(n~^ tr(A)), we obtain that 
(43) max tr(D5f^) =0(n~™ tr(A)'"), m = l,2,.... 

l<i<n 
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Some algebra yields 

LsoCV(W, A) - LsoCV*(W, A) = -Y^{I - Af(D^'^^ + - A)Y, 



n 



where d(i) = W-V2d(i) wd(i) W-V2 and d(2) = wV2d(2) w-V2. 
To show (41), note that 

£;{LsoCV(W, A) - LsoCV*(W, A)} 

(44) = -/i^(I - A)^d(i)(I - A)/2 + - tr{(I - Af'D^'^^I - A)S} 
n n 

+ -M^(I - A)^d(2) (I -A)ii + - tr{(I - A)^D(2) (I - A)S}. 

n n 

Using Lemmas A.l and A. 2 repeatedly and condition 5, we have 
Amax(D(i)) < A^ax(W)A„,ax(W-i)0(n-2tr(A)2) = o(l). 
Thus, the first terms (44) can be bounded as 

-/i^(i - a)^d(i)(i - A)^l = o{l)-^l^{l - A)^(i - A)^l = o{r{w, a)). 



n n 



Using Lemma A. 3, under condition 4 and (43), the second term of (44) can 
be bounded as 



-tr{(I- A)^d(i)(I - A)S} < e(S, W)-tr(DW2^ 
n n 

= e(5], W)0(n-2 tr(A)2) = o(i2(W, A)). 
Now consider the third term in (44). Under condition 5 and (43), 

tr{(Dr + or )n = 2tr(Dr ) + 2tr(Dr or ) 

= 2tr(D;f ) + 2tr(DSfw-iD;fw,) 

(45) 

< 2tr(DSf ') + 2A„,ax(Wri)A,,ax(Wi)tr(DSf ') 
= o(n-2 tr(A)2), 

which implies that all eigenvalues of D^^ -|-D^j are of the order 0(n~ tr(A)), 
and thus o(l). Then, under conditions 1-5, we have 

-/x^(I - A)^d(2)(I - A)/. = ^,.^(1 - A)^(d(2) + d(2)^)(I - A)M 
n In 

= o(l)-/x^(I - A)^(I - A)/2 = o(fi(W, A)). 
n 
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To study the the fourth term in (44), we have 

itr{(I- A)^d(2)(I- A)S} 

n 

1 " 

(46) =-Y^ tr{(In - Aii)^Df {lii - Au)^i} 

1=1 

1 " 1 
- i5:tr(A^D(f A,,S,) + itr(A^D(2)AE). 

To bound the first term in (46), we note that 

= \ tr{(I,, - A,,)^(w|/^D;f WrV2 ^ ^-i/2j3(2)^i/2^^j^^ _ ^^^^^^^^ 
which is bounded by 
\ tr{(I,, - A,,)^(w|/^D;f W|/V". + a^Wr^/^Dg) wrV2)(i^^ _ a,,)5]J 

< ie(S„ W,)tr(D;f ) + ^tr{(DS? - 2Djf )Wri/2s,Wri/2} 

+ |tr{DfA.wrV^5].wrV^A.} 

< ie(S,W){2 + A„,ax(Ai)}tr(DSf ) 
= o(/Z(W,A)), 

where we take = Aniax(Wj). The last equation follows from (43) and con- 
dition 4. Similarly, we can show that the second part of (46) is o(i?(W, A)). 
Consider the third part of (46), i tr(A^D(2) AS) = o(l)i tr(A^AS;) = 

o(-R(W, A)) since all eigenvalues of D^-^ + D are of the order o(l) as is 
shown in (45). Hence, (46) gives 

- tr{(I - A)^D(2) (I - A)S} = o{R{W, A)). 
n 

Therefore, (41) has been proved. 

Next, we proceed to prove (42). Define envelop matrices 

13(1)* ^ j3(i) 

and D(2r =diag{D;f ,...,DiT}, where D^f = ^{w'/^B^^^wy^ /a, + 
aiW~^^^ X D^f Wr^/2) with Oi = Amax(Wi). It is easy to check that D^^)* 
and D^^)* are valid envelops of D^-*^) and D^^^, respectively. Since under 
condition 5, we have 

tr(D«W,D«^) 

< A„ax(W)A^ax(W)A„,ax(W-i)ALx(Di' Vr(DSl^') 
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= {Amax(W)A„,ax(W-i)0(n^2 tr(A)2)}A^a.(W)0(n"2 tr(A)2) 
= A^ax(W)0(n-2tr(A)2), 
tr(DSP*Wi) < An,a.(Wi)tr(D;p) = A„,a.(W)0(n-2 tr(A)2) 

and 

tr(DS?W,DSf ^) < A^a.(W,Otr(D;f ) 

= A,nax(W)0(n-4tr(A)4) 
= A^ax(W)o(n-2tr(A)2), 

tr(Djfw,)<A^ax(W,)tr(Df ) 

= A„ax(W)0(n-2tr(A)2). 

By applying Lemma A. 5, we have 

\ Var{Y^(I - A)^D(™) (I - A)Y} = oJR^{W, A)), m = 1, 2, 
and (42) follows by the Cauchy-Schwarz inequality. □ 

SUPPLEMENTARY MATERIAL 

Efficient algorithm and additional proofs (DOT: 10. 1214/12- AOS1063SUPP; 
.pdf). In the Supplementary Material, we give a detailed description of the 
algorithm proposed in Section 3.2. In addition, proofs of some technical 
lemmas are also included. 
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